A Very Fast Method for Clustering Big Text Datasets

نویسندگان

Frank Lin

William W. Cohen

چکیده

Large-scale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pair-wise similarities between data points due to the prohibitive cost, timeand space-wise, in operating on a similarity matrix, where the state-of-the-art is at best quadratic in time and in space. We present an extremely fast and simple method also using the power of all pair-wise similarity between data points, and show through experiments that it does as well as previous methods in clustering accuracy, and it does so with in linear time and space, without sampling data points or sparsifying the similarity matrix.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

Fast Data Clustering and Outlier Detection Using K-means Clustering on Apache Spark

The components forming the information society nowadays are seen in all areas of our lives. As computers have a great deal of importance in our lives, the amount of information has begun to gather meaningful and specific qualities. Not only the amount of information is increased, but also the speed of access to information has increased. Large data is the transformed form of all data recovered ...

متن کامل

A Comparative Study of Some Clustering Algorithms on Shape Data

Recently, some statistical studies have been done using the shape data. One of these studies is clustering shape data, which is the main topic of this paper. We are going to study some clustering algorithms on shape data and then introduce the best algorithm based on accuracy, speed, and scalability criteria. In addition, we propose a method for representing the shape data that facilitates and ...

متن کامل

A Simple Yet Fast Clustering Approach for Categorical Data

Categorical data has always posed a challenge in data analysis through clustering. With the increasing awareness about Big data analysis, the need for better clustering methods for categorical data and mixed data has arisen. The prevailing clustering algorithms are not suitable for clustering categorical data majorly because the distance functions used for continuous data are not applicable for...

متن کامل

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

A Very Fast Method for Clustering Big Text Datasets

نویسندگان

چکیده

منابع مشابه

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Fast Data Clustering and Outlier Detection Using K-means Clustering on Apache Spark

A Comparative Study of Some Clustering Algorithms on Shape Data

A Simple Yet Fast Clustering Approach for Categorical Data

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

عنوان ژورنال:

اشتراک گذاری